- Why not linear regression?
- Logistic regression
- Multinomial logistic regression
- Evaluating accuracy
- Bayes theorem, LDA, QDA
- k Nearest Neighbours
9/24/2020
You can think of logistic regression as a parametric model for \[\mathbb{P}(Y = y \mid \mathbf{X} = \mathbf{x})\] as the linear regression problem was a parametric model for \[\mathbb{E}(Y = y \mid \mathbf{X} = \mathbf{x})\]
The binary logistic regression, in particular, models directly the conditional probability of \(Y = 1 \mid \mathbf{X}\), using \[\mathbb{P}(Y_{i} = 1 \mid \mathbf{X}_{i} = \mathbf{x}_{i}) = F(\beta_{0} + \beta_{1} x_{i1} + \dots + \beta_{p} x_{ip}).\]
Bayes Theorem simply says that if I give you \(p(y)\) and \(p(x \mid y)\), you can compute \(p(y \mid x)\). Remember the basic rules of probability?
\[\begin{aligned} \mathbb{P}(Y = k \mid X = x) &= \dfrac{\mathbb{P}(X = x \text{ and } Y = k)}{\mathbb{P}(X = x)} \\ &= \dfrac{\mathbb{P}(X = x \mid Y = k) \mathbb{P}(Y = k)}{\sum_{l=1}^K \mathbb{P}(X = x \mid Y = l) \mathbb{P}(Y = l)} \\ &= \dfrac{\pi_k f_k(x)}{\sum_{l=1}^K \pi_l f_l(x)} \end{aligned}\]
\[\underbrace{p_k(x)}_\text{posterior} = \mathbb{P}(Y = k \mid X = x) = \dfrac{\pi_k f_k(x)}{\sum_{l=1}^K \pi_l f_l(x)} \propto \underbrace{\pi_k}_\text{prior} \cdot \underbrace{f_k(x)}_\text{likelihood}\]
Posterior \(\propto\) prior \(\times\) likelihood
Events:
Goal: find \(\mathbb{P}(A \mid B)\), i.e. the probability of having Covid given a positive test, using
\[ \begin{align} \mathbb{P}(A \mid B) &= \dfrac{\mathbb{P}(B \mid A) \mathbb{P}(A)}{\mathbb{P}(B \mid A) \mathbb{P}(A) + \mathbb{P}(B \mid \sim A) \mathbb{P}(\sim A)} \\ &= \dfrac{0.95 \cdot 0.01}{0.95 \cdot 0.01 + 0.01 \cdot 0.99} \\ \\ &= 0.49 \end{align}\]
The Gaussian density has the form
\[f_k(x) = \dfrac{1}{\sqrt{2 \pi \sigma_k^2}} e^{- \frac{1}{2} \left( \frac{x - \mu_k}{\sigma_k} \right)^2}\]
Here \(\mu_k\) is the mean, and \(\sigma_k^2\) the variance, for the class \(k\).
LDA assumes that all the \(\sigma_k^2\) are the same, i.e. \(\sigma_k^2 = \sigma^2 \ \forall k\). Plugging this into Bayes formula, we get \[\mathbb{P}(Y = k \mid X = x) \propto \pi_k \dfrac{1}{\sqrt{2 \pi \sigma^2}} e^{- \frac{1}{2} \left( \frac{x - \mu_k}{\sigma} \right)^2}\]
To classify at the value \(X = x\), we need to see which of the curves is largest. Taking logs, and discarding terms that do not depend on \(k\), we see that this is equivalent to assigning \(x\) to the class with the largest discriminant score
\[\delta_k(x) = x \frac{\mu_k}{\sigma^2} - \frac{\mu_k^2}{2 \sigma^2} + \log(\pi_k)\]
Note that \(\delta(x)\) is a linear function of \(x\). If there are \(K = 2\) classes and \(\pi_0 = \pi_1 = 0.5\), then one can see that the decision boundary is at \(x = \dfrac{\mu_0 + \mu_1}{2}\)
Typically we do not know the parameters for the normal distributions under each class, but we have training data. We can simply estimate the parameters and plug them into the decision rule.
The prior terms can be estimated via the sample proportions: \[\hat{\pi}_k = \dfrac{n_k}{n}\] We can estimate mean and variances with the corresponding sample statistics.
\[\begin{aligned} &\hat{\mu}_k = \dfrac{1}{n_k} \sum_{i : y_i = k} x_i \\ &\hat{\sigma}_{k}^{2} = \dfrac{1}{n_{k} - 1} \sum_{i : y_i = k} (x_i - \hat{\mu}_k)^2 \end{aligned}\]
Since LDA assumes common variance, we need an estimate: \(\hat{\sigma}^2\) is known as pooled variance \[\hat{\sigma}^2 = \dfrac{1}{n - K} \sum_{k = 1}^K (n_k - 1) \hat{\sigma}_k^2 \]
Theory says that the Bayes classifier, which classifies an observation to the class for which \(p_k(x)\) is largest, has the lowest possible error rate out of all classifiers.
Caveats (sources of error):
The assumption is that \(X = (X_1,X_2, \dots ,X_p)\) is drawn from a multivariate Gaussian distribution, with a class-specific mean vector and a common covariance matrix.
Discriminant function: \[\delta_k(x) = x^\intercal \Sigma^{-1} \mu_k - \frac{1}{2} \mu_k^\intercal \Sigma^{-1} \mu_k + \log \pi_k\] This can be written as: \[\delta_k(x) = a_{k0} + a_{k1} x_1 + \dots + a_{kp} x_p\] which is a linear function of \(x\).
\[f(x) = \frac{1}{(2 \pi)^{p/2} |\Sigma|^{1/2}} e^{- \frac{1}{2} (x - \mu)^\intercal \Sigma^{-1} (x - \mu)}\]
Note that there are three lines representing the Bayes decision boundaries because there are three pairs of classes among the three classes.
With Gaussians but different \(\Sigma_k\) in each class, i.e. \(f_k(x) = \mathcal{N}_p(\mu_k, \Sigma_k)\), we get quadratic discriminant analysis (QDA).
With \(f_k(x) = \prod_{j=1}^p f_{jk}(x)\) (conditional independence model) in each class we get naive Bayes. For Gaussian this means the \(\Sigma_k\) are diagonal.
For a two-class problem, one can show that for LDA
\[\log \left( \dfrac{p_1(x)}{1 - p_1(x)} \right) = c_0 + c_1 x_1 + \dots + c_p x_p\]
so it has the same form as logistic regression.
The difference is in how the parameters are estimated:
Despite these differences, in practice the results are often very similar.
Logistic regression can also fit quadratic boundaries like QDA, by explicitly including quadratic terms in the model.
where \(\mathcal{N}_p(\mu, \Sigma)\) is the multivariate normal distribution.
Given a positive integer \(k\) and a test observation \(x\), the k-NN classifier: